NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Run-length compressed metagenomic read classification with SMEM-finding and tagging

https://doi.org/10.1101/2025.02.25.640119

Depuydt, Lore; Ahmed, Omar Y; Fostier, Jan; Langmead, Ben; Gagie, Travis (February 2025, bioRxiv)

Abstract Metagenomic read classification is a fundamental task in computational biology, yet it remains challenging due to the scale, diversity, and complexity of sequencing datasets. We propose a novel, run-length compressed index based on the move structure that enables efficient multi-class metagenomic classification inO(r) space, whereris the number of character runs in the BWT of the reference text. Our method identifies all super-maximal exact matches (SMEMs) of length at leastLbetween a read and the reference dataset and associates each SMEM with one class identifier using a sampled tag array. A consensus algorithm then compacts these SMEMs with their class identifier into a single classification per read. We are the first to perform run-length compressed read classification based on full SMEMs instead of semi-SMEMs. We evaluate our approach on both long and short reads in two conceptually distinct datasets: a large bacterial pan-genome with few metagenomic classes and a smaller 16S rRNA gene database spanning thousands of genera or classes. Our method consistently outperforms SPUMONI 2 in accuracy and runtime while maintaining the same asymptotic memory complexity ofO(r). Compared to Cliffy, we demonstrate better memory efficiency while achieving superior accuracy on the simpler dataset and comparable performance on the more complex one. Overall, our implementation carefully balances accuracy, runtime, and memory usage, offering a versatile solution for metagenomic classification across diverse datasets. The open-source C++11 implementation is available athttps://github.com/biointec/taggerunder the AGPL-3.0 license.
more » « less
Free, publicly-accessible full text available February 28, 2026
Movi: A fast and cache-efficient full-text pangenome index

https://doi.org/10.1016/j.isci.2024.111464

Zakeri, Mohsen; Brown, Nathaniel K; Ahmed, Omar Y; Gagie, Travis; Langmead, Ben (December 2024, iScience)

Free, publicly-accessible full text available December 1, 2025
SPUMONI 2: improved classification using a pangenome index of minimizer digests

https://doi.org/10.1186/s13059-023-02958-1

Ahmed, Omar Y.; Rossi, Massimiliano; Gagie, Travis; Boucher, Christina; Langmead, Ben (May 2023, Genome Biology)

Abstract Genomics analyses use large reference sequence collections, like pangenomes or taxonomic databases. SPUMONI 2 is an efficient tool for sequence classification of both short and long reads. It performs multi-class classification using a novel sampled document array. By incorporating minimizers, SPUMONI 2’s index is 65 times smaller than minimap2’s for a mock community pangenome. SPUMONI 2 achieves a speed improvement of 3-fold compared to SPUMONI and 15-fold compared to minimap2. We show SPUMONI 2 achieves an advantageous mix of accuracy and efficiency in practical scenarios such as adaptive sampling, contamination detection and multi-class metagenomics classification.
more » « less

Search for: All records